-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8811][SQL] Read array struct data from parquet error #7209
Conversation
Test build #36489 has finished for PR 7209 at commit
|
@liancheng we changing like this resolved the parquet query issue i send to you, but it failed the unit test, can you have a look? |
Test build #36494 has finished for PR 7209 at commit
|
LGTM |
@Sephiroth-Lin @scwf This issue is actually much more complicated than it looks like. The TL;DR is that, in the early days, Parquet didn't explicitly specify how LIST and MAP should be constructed, and different systems and tools just reinvent their own wheels. The consequence is that it breaks Parquet interoperability. Namely, Parquet files written by system A might not be read by system B. The most recent Parquet format spec tries to fix this by specifying LIST and MAP structures explicitly and adding backwards-compatibility rules (1, 2) to cover existing legacy data files. We are trying to make Spark SQL compatible with Parquet format spec. This work consists of three parts:
I'm currently working on part 2, which fixes your problem here. A PR will be sent out soon. |
After rethinking about this PR, I think it does spot another issue: the current master breaks backwards-compatibility of reading Parquet files created by parquet-avro. When converting a Spark SQL schema to a Parquet schema, Spark 1.4.0 and prior versions mostly follow parquet-avro, and convert arrays which may contain null values into something like this:
Please note the field name @Sephiroth-Lin Would you mind to fix this issue by changing the |
@Sephiroth-Lin @scwf The aforementioned PR is here: #7231. A test case for SPARK-8811 is added. |
wow that's cool !! |
do we still need file a PR to changing the array_element string to array? |
@liancheng OK, good, thank you. |
@liancheng I have updated, please help to review, thank you! |
Test build #36638 has finished for PR 7209 at commit
|
LGTM |
@@ -490,7 +490,7 @@ private[parquet] class CatalystSchemaConverter( | |||
.buildGroup(repetition).as(LIST) | |||
.addField( | |||
Types.repeatedGroup() | |||
.addField(convertField(StructField("element", elementType, containsNull))) | |||
.addField(convertField(StructField("array", elementType, containsNull))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line shouldn't be changed. As commented above, this case branch is implements standard Parquet schema conversion following the Parquet format spec, which explicitly require the inner most element type name to be element
.
Test build #36651 has finished for PR 7209 at commit
|
Test build #36653 has finished for PR 7209 at commit
|
Test build #36663 has finished for PR 7209 at commit
|
@@ -446,7 +446,7 @@ private[parquet] class CatalystSchemaConverter( | |||
field.name, | |||
Types | |||
.buildGroup(REPEATED) | |||
.addField(convertField(StructField("element", elementType, nullable))) | |||
.addField(convertField(StructField("array", elementType, nullable))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I made a mistake here. We should leave this array
as array_element
.
This is a little bit complicated... So in the early days, when Spark SQL Parquet support was firstly authored, Parquet format spec wasn't clear about how to write arrays and maps. So Spark SQL took a somewhat weird approach here: if the array may contain nulls, we mimic parquet-hive, which writes a 3-level structure with array_element
as the 2nd level type name; otherwise, we mimic parquet-avro, which writes a 2-level structure with array
as the 2nd level type name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, PR #7231 already covers the original bug this PR tried to fix. We'll be able to read Hive data with legacy format. The field names changed here matter for the write path, because we want to write exactly the same format as older Spark SQL versions when compatible mode is turned on.
Hey @Sephiroth-Lin, do you mind me forking this PR branch and continue work on this (will still credit you as the main author)? Parquet schema conversion is particularly hard to get right because there are a bunch of head scratching historical compatibility issues :( |
@liancheng OK, no problem. Thank you! |
Cool, then would you mind closing this PR for now? |
Opened #7304 for fixing this issue. |
@Sephiroth-Lin please close this PR. |
…hen handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <[email protected]> Closes #7304 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
…hen handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <[email protected]> Closes #7314 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
@Sephiroth-Lin you should add the email you used for the commit to your github profile. Then it will show up as your commit. |
@Sephiroth-Lin BTW, I added your email address manually when merging #7314. (Failed to update the author field when merging this PR so I reverted this one and reopened it as #7314.) |
JIRA:https://issues.apache.org/jira/browse/SPARK-8811
so when read data from parquet will cause java.lang.ArrayIndexOutOfBoundsException